Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement RecHit SOA and move to new framework #322

Closed

Conversation

VinInn
Copy link

@VinInn VinInn commented Apr 15, 2019

in this PR (on top of #312 and #318
TrackingRecHit is introduced as CUDAFormats
RecHit producer migrated to new framework #100
Clients migrated.
tbd: rename and clean

all three wf tested

also moved to constant memory in doublet builder following Hackaton investigation
(10% speed up in the kernel)

rovere and others added 30 commits September 3, 2018 12:26
…atrices of static dimensions to run on the GPUs
…o use matrices of static dimensions in order to run on the GPUs.
- deleted the forgotten prints and time measurements;
- created a new modifier for the broken line fit;
- switched back from tipMax=1 to tipMax=0.1 (the change will maybe be done in another PR);
- restored the original order of the cuts on chi2 and tip;
- deleted the default label to pixelFitterByBrokenLine;
- switched from CUDA_HOSTDEV to __host__ __device__;
- BrokenLine.h now uses dinamically-sized-matrices (the advantage in using them over the statically-sized ones is that the code would also work with n>4) and, as before, the switch can be easily done at the start of the file;
- hence, the test on GPUs now needs an increase in the stack size (at least 1761 bytes);
- some doxygen comments in BrokenLine.h have been updated.
@VinInn
Copy link
Author

VinInn commented Apr 15, 2019

use
VinInn/cmssw@gpuSmartAllocDoublets...VinInn:gpuNewRecHits
to review the changes introduced by this PR

@VinInn VinInn requested a review from makortel April 15, 2019 12:17
Copy link

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to take another look, but here is a first round of comments for the RecHit SOA+migration commits.



m_store16 = cs->make_device_unique<uint16_t[]>(nHits*n16,stream);
m_store32 = cs->make_device_unique<float[]>(nHits*n32+11+(1+TrackingRecHit2DSOAView::Hist::wsSize())/sizeof(float),stream);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arrays are not necessarily 128-byte-aligned, right? (or do I miss something?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I was thinking to introduce a stride function computing the required stride
((n*b+127)/128)*128/b and use it as stride(nHits,4) stride(nHIts,2) consistently.
it is also true that these arrays are accessed mostly random, so does not make much difference if aligned

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.


view->m_charge = (int32_t *)get32(8);
view->m_xsize = (int16_t *)get16(2);
view->m_ysize = (int16_t *)get16(3);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could reinterpret_cast be used here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what difference it makes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aesthetics (also code rules)

@@ -9,6 +9,7 @@
#include "DataFormats/GeometrySurface/interface/SOARotation.h"
#include "Geometry/TrackerGeometryBuilder/interface/phase1PixelTopology.h"
#include "HeterogeneousCore/CUDAUtilities/interface/cuda_cxx17.h"
#include "HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this include only for testing purposes, or really needed at the moment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not strictly needed at the moment...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually is needed due to the device functions....

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the header gets included in some CPU .cc file as well? Ok.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there is the possibility of using this very CPE in the standard CPU wfs...

CUDAProduct<TrackingRecHit2DCUDA> const& inputDataWrapped = iEvent.get(tokenHit_);

// try to be in parallel with tracking
CUDAScopedContext ctx{iEvent.streamID(), std::move(waitingTaskHolder)};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #305 this comment is not really accurate, and the context should be constructed as

Suggested change
CUDAScopedContext ctx{iEvent.streamID(), std::move(waitingTaskHolder)};
CUDAScopedContext ctx{inputDataWrapped, std::move(waitingTaskHolder)};

This work and the pixel tracking will be run in separate CUDA streams.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

heterogeneous::GPUCuda,
heterogeneous::CPU
> > {
class SiPixelRecHitHeterogeneous : public edm::global::EDProducer<> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to (eventually) rename this class as SiPixelRecHitCUDA.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, in the plan


convertGPUtoCPU(iEvent.event(), hclusters, *output);
}
gpuAlgo_.makeHitsAsync(hits,digis, clusters, bs, fcpe->getGPUProductAsync(ctx.stream()), ctx.stream());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively makeHitsAsync() could construct and return TrackingRecHit2DCUDA.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally I though to keep separate the CPU class with storage in the producer and have the algo only depending on the View. This did not work out as one needs more pointers on the host.
So yes, it is a possibility

int16_t * iph = &hits.iphi(0);
float * xl = &hits.xLocal(0); float * yl = &hits.yLocal(0);
float * xe = &hits.xerrLocal(0); float * ye = &hits.yerrLocal(0);
int16_t * xs = &hits.clusterSizeX(0); int16_t * ys = &hits.clusterSizeY(0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation is off.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, was just copied from the signature.
will fix and beatify

siPixelRecHitsLegacyPreSplitting = cms.VPSet(
cms.PSet(type = cms.string("SiPixelRecHitedmNewDetSetVector"))
)
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, in RecHit case there is no need for an alias as SiPixelRecHitFromSOA and legacy SiPixelRecHitConverter produce the same products, so this could be simply

    cuda = _siPixelRecHitFromSOA.clone()

(after moving the corresponding import above this line)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. I just tried to make it working waiting for your explanation of the meaning of all that

@@ -68,13 +70,19 @@ PixelCPEFast::PixelCPEFast(edm::ParameterSet const & conf,

const pixelCPEforGPU::ParamsOnGPU *PixelCPEFast::getGPUProductAsync(cuda::stream_t<>& cudaStream) const {
const auto& data = gpuData_.dataForCurrentDeviceAsync(cudaStream, [this](GPUData& data, cuda::stream_t<>& stream) {

std::cout << "coping pixelCPEforGPU" << std::endl;
//here or above???

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want it to print out when the transfer is initiated, this is the correct place. ("above" outside of the lambda would print every time the product is asked from)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, indeed. I wanted just to make sure it was really done once
(I was planning to fill here some constant stuff, but does not work across libraries)

@fwyzard fwyzard added the Pixels Pixels-related developments label Apr 16, 2019
@fwyzard
Copy link

fwyzard commented Apr 30, 2019

This PR should be superseded by #324 / #329, apart from #324 (comment) .

@fwyzard fwyzard closed this Apr 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Pixels Pixels-related developments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants